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Abstract 

We extend a result of Goldreich and Ron about estimating the collision probability of 
a hash function. Their estimate has a polynomial tail. We prove that when the load factor 
is greater than a certain constant, the estimator has a gaussian tail. As an application we 
find an estimate of an upper bound for the average search time in hashing with chaining, 
for a particular user (we allow the overall key distribution to be different from the key 
distribution of a particular user). The estimator has a gaussian tail. 

1 Introduction 

Hash tables have many applications in computer science pQ, [B]. We especially mention data 
bases, where hash tables are used for storing values of an attribute; see chapter 12 of 
Following the notation of pQ, a hash function is a function h : U i— > T, where both the domain 
U and the range T are finite. Traditionally, U is called the key space or the "universe" , and 
elements x G U are called keys. The set T is called the the table, and its elements are called 
the table slots. When h(x) = i we say that h hashes the key x into the slot i. We shall denote 
by n the cardinality of T and we will simply assume that T = {1, . . . , n}. We assumed that U 
is (very much) larger than T. 

We assume that a probability measure q has been defined on U. The probability of S (c U) 
is denoted by P(S) ( = XLes l( x ))- We also put the product measure on U x U and on U m 
(for any positive integer m); using the product measure amounts to saying that in a sequence 
of m keys, all the keys are independent. 

The probability on U induces a probability measure on T: The probability that some key 
hashes to slot i (e T) is pi = J2 xeh -i^ q(x) = P(/i _1 (z)). 

If two keys xi,X2 & U have the same hash value, these keys are said to collide. The collision 
probability of the hash function h is defined to be P{(xi,a;2) G U x U : h{x\) = h(x2)} (in 
short-hand this is denoted by P(h(xi) = h{x2)))- Here we use the product measure (i.e., keys 
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are "chosen independently"). A true collision corresponds to keys xi,X2 G U such that X\ ^ x 2 
and h{x\) = h(x 2 ). 

Throughout this paper, ||.|| denotes euclidean norm. It is straightforward to prove the 
following. 

Proposition 1.1 The collision probability of h is equal to Y2^=iPi ( = INI 2 )- 
Moreover, we always have Y^l=iPi — > an d equality holds iff Pi — - for alii £ T. 

Similarly, the probability that two independently chosen keys are equal is Ylxeu Q( u ) 2 - Hence, 
the probability of true collisions for h is Y^n=iPi ~ Y^ x &u q( u Y 

Note that ^2 xeU q{u) 2 will usually be very small assuming that U is very large (compared 
to n and compared to the length m of key sequences used), and assuming that the probability 
distribution q on U is not very concentrated. Therefore, the difference between the collision 
probability \\p\\ 2 and the probability of true collisions is usually quite small. 

In this paper we assume that collisions are resolved by some form of chaining; i.e., all the 
keys that are hashed into one slot are stored in that slot. For a hash table with chaining, we 
will simply assume that the search time (for both successful or unsuccessful search) in a slot 
% is proportional to the number of keys stored in that slot; for simplicity, we simply identify 
search time in a slot and chain length in the slot. 

Notation l %{x) n : Let ) be a sequence of m keys that are inserted into our 

hash table, and let i be a slot (i = 1, . . . , n). We let ki(x) denote the number of keys (counted 
with multiplicities) inserted into slot i. ("With multiplicities" means that if a key occurs several 
times in x it is counted as many times as it occurs.) 

Since in ki(x) we count keys with multiplicities, ki(x) is an upper bound on the number of 
different keys stored in slot i. 

Proposition 1.2 For a sequence of keys x = (x\, . . . ,x m ) that are inserted, the number of 
collisions between keys in x is 

E ki(x)(ki(x) - 1) 
2 

i=i 

The proof is straightforward. Recall that we count pairs of equal keys in the sequence x as 
collisions. Since there are m ^ m ~ 1 ^ unordered pairs of key insertions in x, we call 

kj(x)(kj(x) - 1) 

' m(m — 1) 

i=i v ' 

the empirical collision probability of x. This concept, and its relation with the collision proba- 
bility \\p\\ 2 , were first studied by Goldreich and Ron j2j. 
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In this paper we obtain two results, in the form of deviation bounds. (1) We give an 
estimation of the collision probability. (2) We give a deviation bound for an upper bound on 
the average search time. 

In the second result we assume that the load factor is > 9 (see later for the exact assump- 
tions). Applications in data bases often lead to hash tables with large load factor ([4J, Chapter 
12). We allow arbitrary key distributions. 



Estimation of the collision probability 



n ki(x)(kj(x)—l) 



m(m— 1) 



IS 



Our first result extends a result of Goldreich and Ron namely that Y^h=i 
a very good estimator for the collision probability \\p\\ 2 . How good the estimator is can be 
measured by the relative error I V™ , M^XM*)- 1 ) . _!_ — l|. Their result, as well as ours, gives 

I t—di=\ m(m— 1) IIjPII 

a deviation bound for this relative error. Goldreich and Ron [2] proved a polynominal deviation 
bound for the estimator V n , Ma)(Mz)-i) ^ Their rr oa j was to find sublinear-time algorithms 

a-/j=i m(m— 1) ° ° 

for testing expansion properties of bounded-degree graphs. 

Theorem 1.3 (Goldreich and Ron 0). For all (3 > 0, A > 0, if m = n 1/2+f3+x then 



E- 



ki(x)(ki(x) - 1) 



1 



1 



i=i 



< 



mym — 1) \\p\\ 
We extend the theorem of Goldreich and Ron as follows: 



n 



p/2 



> 1 



9n x ' 



Theorem 1.4 For all n > 24, i>e>0, S > 0, s > 0, if m = e 2 n 1+s we have 



E 



h{x){h{x) - 1) 



171(111 



\P\ 



< e 3 



6s 

5/2 



+ 



hs 2 e 



> 1 



10 



-s 2 /A 



By taking s = 2 n 5 ^ 2 , the expression 3 + + becomes 3 + 12 + 20 e (< 22); here we use 
e < |. Therefore, 

Corollary 1.5 For all n > 24, \ > e > 0, 5 > 0, if m — e~ 2 n l+s we have 



{|eiLi 



n fcj(g)(fcj(a)--l) 1 



m(m— 1) 



<22e!> > 1-^e-" 4 . 



Writing 5 



logC 



for C > 1, we obtain n" 5 = C, and m = e 2 Cn, i.e., the load factor is 



L = Ce 2 . Therefore, 

Corollary 1.6 For all n > 24, | > e > 0, cm<i all m such that L = ^ > e -2 (> 9) we have 



y-> kj(x)(ki(x) - 1) _L 

m(m — 1) ||p|| 2 



< 22 e ^ > 1 e 

9 



10 _ Le2 
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Note that the assumptions of this Corollary impose the following relation between L and e: 
| > e > equivalently, L = ^ > e~ 2 (> 9). 

To compare with the result of Goldreich and Ron, let us pick e = n~^l 2 in Corollary 11.51 
Then n 1 / 2 +f 3 + x — m — e~ 2 n 1+s implies 8 = A — |. Hence our Corollary becomes: 

Corollary 1.7 For all n > 24, /3 > X> \, if m = n 1/2+/3+x we have 



ki{x){ki(x) - 1] 



( m(m — 1) Hp! 2 



22 o /9 „ 10 „a- 

— n' 13 ' 2 > > 1 e" n 

5 \ ~ 9 



Comparing 11.71 with the theorem of Goldreich and Ron: Our theorem gives a much better 
deviation bound (it is exponential, as opposed to the polynomial bound of Goldreich and Ron); 
but it applies only when the load factor L is > 9 (whereas in the result of Goldreich and Ron, 
the load factor L = nP +x ~ 1 / 2 can be arbitrarily small, depending on n). 

The average search time for a particular user 

In order to analyze the efficiency of a hash table one considers the overall usage statistics of 
the keys (over all users). By "user" we mean a person or a process. For every user we introduce 
a vector v = (vi, . . . ,v n ), where u» is the frequency of the user's access (for search) to slot i. 
More precisely, is the number of searches at slot i, divided by the total number of searches 
in the table, for this user. Then < Uj < 1 and J2i=i v i = 1- We shall call v the user's access 
pattern. Traditional analysis of the average search time assumes that the accesses pattern of a 
user is the same as the key distribution (see e.g., [I]). 

We let AST(t>, x) denote the average search time for a user with access pattern v, under the 
condition that a sequence x of m independent keys was previously inserted into the hash table. 
Clearly, we have the following upper bound: 

AST(w,x) < J27=i v i " 

The difference between AST(t>,x) and X^i^ ' ^i( x ) * s caused by the possibility of pseudo- 
collisions. Here we are only concerned with upper bounds on AST(f , x), so we can use X^=i v i ' 
ki(x). 

We write m as m = Ln, where L is called the load factor. We do not assume that L is a 
constant. Applying Theorem 11.41 we show 

Corollary 1.8 For all n > 24, s > 0, L > 9, and m = Ln we have 




P<|AST(t;,x)< LnHIHI Jl + 5^ + 5^ + 1 1 > 1 - ye^ 4 . 



Noting that Jl + + ^- < 1 + 4= and letting e = ^ we obtain 
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Corollary 1.9 For all n > 24, e > 0, L > 9, and m = Ln we have 

P{AST(u,x)< Ln|M||b||(l + 8e) + l} > 1 - ^e" ie2 . 

9 

One notices that the probability bound is only interesting when L is significantly larger than 
e~ 2 . Also, the error bound is interesting only when e is less than 1/8; this means that the 
load factor has to be at least 100 for our results to be intersting. In that sense, the results are 
theoretical, and show just what type of behavior to expect, up to big-O. 

In (chapt. 12, exercise 12-3) the expected search time (for every user) was found to 
be (L) , under the assumption that both the key distribution and the distribution of user's 
accesses are uniform. Our Corollary implies that if ||p|| 2 = O (-) and \\v\\ 2 = O (-) (which 
is much more relaxed than the assumption of a uniform distribution), then with exponentially 
high probability, the average search time is O(L) for a user with access pattern v. 

Example 1 

Suppose that a hash table, designed for a certain population of users, has collision probability 
IMI < -^7= (for the overall population of users); c is a positive constant. The keys in the hash 
table are independent random samples. Now consider an individual user who accesses a subset 
of cardinality an (where < a < 1) of the n slots of the hash table, with uniform probability 
— , and who does not access the other (1 — a)n slots of the hash table at all (i.e., those slots 

an ' \ / \ > 

have probability for this user). Then the question is: What is the average search time for this 
user and this table, and what is the deviation bound? 

Since the user accesses a fraction a of the slots uniformly, we have Hull = By Corollary 

J 1 ii ii \JOLn J J 

P{AST(u,x)< ^=(l + 8e) + l} > l-fe" Le2 . 
So, the average search time is at most 1 + ^= , with smaller error bound (namely ^= 8e), and 

with probability close to 1 (namely 1 — ^§e~ Le2 ). 

One observes that when the fraction a of the table used by the user becomes smaller, the 
upper bound on the average search time for this user increases, as does the error bound. This is 
not surprising; hashing works best when the keys are spread over the table as evenly as possible. 
Interestingly, our probability bound does not depend on a. 

Some possible numerical values: For c = 5, a = 0.1, e = 0.05, L = 1000, we get 
AST(u,x) < 15811 ± 6324, with probability at least 1 - fe~ Le2 = 0.909. For c = 5, 
a = 0.1, e = 0.05, L = 10000, we get AST(u, x) < (1.58 ± 0.64) • 10 5 , with probability at least 
1 - 1.54- HT 11 . 

Example 2 

Let us consider the situation in which a query consists of two subqueries, Q\ and Q 2 . This 
happens very commonly (e.g., in a "three-tier architecture"); see [3]. The two subqueries can 
be viewed as two users with access patterns and v^ 2 \ Assume, for this example, that each 
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of Qi and Q 2 behaves like the user in Example 1 above. In particular, for Qi (i = 1, 2) we have 
|| V W|| — _i anc [ 

P{ASTi(t;«,x) < ^=(l + 8e) + l} > 1 - f e^ 2 . 
Hence, for the combined query the average search time is a weighted sum 

AST = w\ ■ ASTx + w 2 ■ AST 2 , with W\ + w 2 = 1. 
Let ai = -¥= (1 + 8e) + 1. Then 

P{AST < W\di + w 2 a 2 } > P{ASTi < max{ai, 02}, AST 2 < max{a 1; a 2 }} 
> i_ 2 fe- ie2 . 

Therefore, the average search time AST(uW, x) of the combined query satisfies 
P{kST(v {l \v^\x) < -. cL (l + 8e) + ll > l-^e" Le2 . 

-y/min{ai,a 2 } 9 

Hence, when the load factor is large (compared to e 2 ) we obtain a very reliable upper bound on 
the average search time for the combined query. The knowledge of this upper bound enables 
various processes (that wait for the completion of this query) to be scheduled in a predictable 
way. 

The constants in our results are rather large. This is due to the generality of our results. 
In a precise practical situation, our results could be used for the format of the probabilistic 
behavior, with constants to be determined empirically. 

The next section contains the proofs of our theorems. 



2 Proofs 

2.1 A deviation bound for the empirical collision probability: Proof 
of Theorem 11.41 

Our main technique will be Talagrand's isoperimetric theory, developed by Talagrand in the 
mid 1990s jSj. It has had a profound impact on the probabilistic theory of combinatorial 
optimization [3] (see Sections 6 - 13 of [H] and chapter 6 of 

Let (O,/^) be a probability space, and let (f2 m ,/i m ) be the product space. For x G fl m and 
A C Q m , Talagrand's convex distance djn{x,A) is defined by 

z a = inf { ^ ai l(xj ± Vj) \ '■ « = («i, • • • , «m), V] a| < 1 
ye U=i J 3=1 

where x = (xi, . . . , x m ), y = (y%, . . . , y m ). Here, l(xj ^ yi) = 1 if x« ^ y iy and it is otherwise. 
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Theorem 2.1 (Talagrand 1995) For every A C VL m with fj, m (A) > 0, we have 
an<i consequently, we have for all s > 0, 



P{rf T (x, A) > s} < 



/i m (A)' 

To apply Talagrand's theorem to our situation we define a set A C f/ m by 



A = ly e U' 



m{m — 1) 



2 



< 3e 



i=l 

Lemma 2.2 For all n > 24 we have P(A) > ^. 

Proof. Recall that m = e~ 2 n 1+s with § > e > 0, 5 > 0. Letting /? = and A = 1/2 + 5, 

we rewrite m as 77, 1 / 2 +/ 3 + A . Then the lemma follows from Theorem of Goldreich and Ron. □ 



For every s > we define a set C s C f/ m by 

C s = {x G U m : rf T (x,A) < s}. 
By Theorem 12.11 and Lemma 12.21 we have for all n > 24 and all s > 

P(C.) > 1 - ye^ 2 / 4 . (1) 

Lemma 2.3 For every x = (xi, . . . , x m ) G C s there is y = (yi, . . . , y m ) G A s^c/i t/iat 

^Hxj^yj) < sm 1 ' 2 . 

3=1 

Proof. Assume, by contradiction, that there is x G C s such that for all y G A, 



sm 



1/2 



Now, if we take a = (cti, . . . , a m ) = (m™ 1 / 2 , . . . , m~ x ' 2 ) in the definition of the Talagrand 
distance dr, the inequality above implies dr(x,Ai) > s. But since x G C s , we also have 
Ai) < s, a contradiction. □ 

Recall that for any x = (x±, ...,x m ), y = (y±, ...,y m ) G U m , we defined ki(x) (resp. ki{y)) to 
be the number of the keys (with multiplicity) that are hashed into the slot i for input sequence 
x, resp. y. We define integers Sj (1 < % < n) by 

ki(x) = ki(y) + s^ 
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Lemma 2.4 For all x, y G £/ m , 



i=i 



Proof. We prove the lemma by induction on ^I=i 7^ Vi)- 
(0) E7=i !(% ^ Vi) = 0: 

Then we have Xj = yj for all j — 1, . . . , m, and hence, ki(x) = ki(y) for all i — 1, . . . , n. Thus, 
we have Ym=i \ s i\ = 0' finishing the base case. 
(Inductive step) Assume Y^j=\ ^( x j ^ Vj) > 0- 

Without loss of generality we assume that x m ^ y m . Now, consider x — (x\, . . . , x m -i,y m ). We 
write ki(x) = ki(y) + Sj for i — 1, . . . , n. By the induction hypothesis we have 



\ s i\ ^ 2 ^2 1 (xj + Vi 

i=i 



(2) 



i=l 



Since x differs from x only in its last component, we either have h(x m ) = h(y m ), in which case 
Si = Si for all i — 1, . . . , n. Or we have h(x m ) ^ h(y m ); let %\ = h(x m ) and i 2 = h(y m ). Then 
Sjj = Sjj + 1, Sj 2 = Sj 2 — 1, and Sj = Sj for all i G {1, . . . , n} \ {ii, i 2 }. In both cases, 



Ew-E 



i=l 



i=l 



< 2. 



(3) 



On the other hand, 

m m 

j'=i i=i 
Combining this, (j2j), and (JHJ), completes the proof for the inductive step. □ 

Lemma 2.5 For every x G C s there is y G A such that for all n > 24, < e < 1/3, s > 0, 
and m = e~ 2 n 1+5 , we have 



E 



k,(x)(k,(x)-l) AfeWfcW-l) 



mm — 



m(m — 1) 



— e \\p\ 



6s 5s 2 e 



5/2 



+ 



IT 



n" 



i=l y ' i=l 

Proof. For any fixed x G C s we take y G A according to Lemma [2.31 That is, 

m 
3=1 



(4) 



As in the proof for Lemma \2. 41 we use the notation ki(x), ki(y), and Sj (i = 1, . . . , n). We will 
leave the common denominator m(m — 1) out of the computations until the end: 
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\Y:=ik(x)(h(x)-i) - T:=iki(y)(h(y)-i)\ 

= I Eli(ki(y) + SiXkiv) +8i-i)- EIU h{y){h{y) - 1)| 

= I £i<i<n,ki(y)>i + s i)(Hy) + Si - 1) - h(y)(ki(y) - 1)] 

+ El<i<n,fc;(?/)=0 Si ( Si ~ "01 
— El<Kn,)!i(y)>l 2 — 1) + El<»<n,ifei(y)>l ( s i + I S *D + I £l<i<n, Jkj(y)=0 s *(' s * — ^)l 

W(*i(y)-i) + £ILi(* 2 + N)- 

By the Cauchy-Schwarz inequality, this is bounded by 



< 2(E 



l<i<n, ki(y)>l °i 



) 1/2 (E 



l<i<n,fcj(j/)>l 



(^(y)-i) 2 ) 1/2 + EIL^f + N) 



< 2(Er=i^ 2 ) i/2 (Er=i^)(^)-i) 2 ) i/2 +£r=i(* 2 +N)- 

By Lemma f2. 41 and (jl)) we have 

2 / ™ \ 2 



5>; 2 ^ E 



i=l 



< \2^2l(x j ^y j ) ) < As' 

3=1 



m. 



(5) 



Since y E A we have 



Hence, by all the above: 



± kM( , ki{y) - 1) <\\pf(l + 3e). 
^ m(m - 1) ~~ 



E kj{x)(ki(x) - 1) _ >A ki(y)(ki(y) - 1) 



-j' m(m — 1) ■^-j' m(m — 1) 



< 



4s 



IHI(i + 3 e ) 1/2 



4s 2 



2s 



(m — l) 1 / 2 m — 1 m 1 / 2 (m— 1) 

By calculating, and using the fact that ||p|| 2 > -, < e < 1/3, and m = e~ 2 n 1+s , we find the 



following upper bound for 



n kj(x)(kj{x)-l) \~^n kj(y)(kj (y)-l) 



En 
i=l 



m(m— 1) 



En 



m(m— 1) 



se 



4(1 + Ze) 1 ' 2 ^' 2 



+ 



se 



2e 2 n 



+ 



s 2 e 2 „ 2 4n 



Combining this and using n > 24 we obtain the upper bound 

, 6s 5s 2 e, 



n 



1+5 _ e 2 



+ 



n 



5/2 n u 



□ 



Proof of Theorem II. 4L The theorem follows from the definition of A, inequality and 
Lemma 12.51 □ 
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2.2 Average search time for a particular user 

Proof of Corollary II. 8L Recall that the average search time AST(t>,x) is bounded from 
above by Ym=i v i ' ki(x). In Theorem II .41 let us write m = L 1 L 2 n, and choose 

1 logL 2 
e = = and o = - . 

Note that for all i, 

ki(x) - 1 < y/ki(x)(ki(x) - 1) 
since the left side is when ki(x) = or 1. Therefore, 

ast(x,v)< Y.Uvi-h{x) = -i) + i < VTt^VE'UHx)-i) 2 + i 

< vCTVEL^W(^)-l) + l < H lbl|m(m-l) ^EL /^iV i^' 
The corollary follows from this and Theorem 11.41 □ 

Remark. Our proof method depends crucially on Talagrand's theorem. Many readers, more 
familiar with techniques like the Chernoff bound, or more generally, the Hoeffding inequality for 
martingale differences (from which the Chernoff bound follows directly), may wonder whether 
these simpler techniques don't work here. In order to apply Hoeffding's inequality we could 
view YH=i v i ' ki{ x ) as a weighted sum of the random variables ki(x); to apply Hoeffding one 
needs to bound but we don't have good bounds a priori; finding good bounds on 

seems harder and less promising than our method, based on Talagrand's theorem. See, e.g., 
Michael Steele's book [5 , which discusses the advantages of applying Talagrand's theorem at 
length. 
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